2017-03-21

Warmups - Problem 1

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0

What do you want to know?

Warmups - Problem 2

What's in the column names of this data?

id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.2 2.20 4.2 2.63 5.1 4.5 5.5 4.4 3.9 4.2 3.73
Gene 2 1.5 0.59 1.9 0.52 2.9 1.4 3.0 1.3 1.2 1.2 0.89
Gene 3 2.0 0.87 3.3 0.53 4.6 2.2 5.6 2.5 2.5 3.0 1.35

Warmups - Problem 3

What are the variables? What are the records?

melbtemp <- read.fwf("../data/ASN00086282.dly", 
   c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), fill=T)
kable(head(melbtemp[,c(1,2,3,4,seq(5,128,4))]))
V1 V2 V3 V4 V5 V9 V13 V17 V21 V25 V29 V33 V37 V41 V45 V49 V53 V57 V61 V65 V69 V73 V77 V81 V85 V89 V93 V97 V101 V105 V109 V113 V117 V121 V125
ASN00086282 1970 7 TMAX 141 124 113 123 148 149 139 153 123 108 119 112 126 112 115 133 134 126 104 143 141 134 117 142 158 149 133 143 150 145 115
ASN00086282 1970 7 TMIN 80 63 36 57 69 47 84 78 49 42 48 56 51 36 44 39 40 58 15 33 51 74 39 66 78 36 61 46 42 63 39
ASN00086282 1970 7 PRCP 3 30 0 0 36 3 0 0 10 23 3 0 5 0 0 0 0 0 8 0 18 0 0 0 0 13 3 0 25 0 3
ASN00086282 1970 8 TMAX 145 128 150 122 109 112 116 142 166 127 117 127 159 143 114 65 113 125 129 147 161 168 178 161 145 142 137 150 120 114 129
ASN00086282 1970 8 TMIN 50 61 75 67 41 51 48 -7 56 62 47 33 67 84 11 41 18 50 22 28 74 94 73 88 50 48 54 78 47 18 39
ASN00086282 1970 8 PRCP 0 66 0 53 13 3 8 0 0 0 3 5 0 0 64 3 99 36 8 0 0 0 8 36 25 30 56 5 69 3 20

Warmups - Problem 4

What are the variables? What are the observations?

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)
kable(pew[1:5, 1:5])
religion <$10k $10-20k $20-30k $30-40k
Agnostic 27 34 60 81
Atheist 12 27 37 52
Buddhist 27 21 30 34
Catholic 418 617 732 670
Don’t know/refused 15 14 15 11

What we are going to cover today

  • Reading different data formats
  • Tidying data
  • Split - apply - combine
  • Working with dates
  • Plotting your data

French fries - hot chips

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0
1 1 15 1 1.2 0.1 0.0 1.1 5.1
1 1 15 2 8.8 3.0 3.6 1.5 2.3

What would we like to know?

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do chips taste more rancid over the weeks?

Each of these questions involves different summaries of the data.

What we have and what we want

Gathering

  • When gathering, you need to specify the keys (identifiers) and the values (measures).

Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)

Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities

Gathering the French Fries

library(tidyr)
ff_long <- gather(french_fries, key = variable, value = rating, potato:painty)
head(ff_long)
#>   time treatment subject rep variable rating
#> 1    1         1       3   1   potato    2.9
#> 2    1         1       3   2   potato   14.0
#> 3    1         1      10   1   potato   11.0
#> 4    1         1      10   2   potato    9.9
#> 5    1         1      15   1   potato    1.2
#> 6    1         1      15   2   potato    8.8

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps displaying in a table).

This is called "spreading" the data.

Spread

We use the spread function from tidyr to do this:

french_fries_wide <- spread(ff_long, key = variable, value = rating)

head(french_fries_wide)
#>   time treatment subject rep buttery grassy painty potato rancid
#> 1    1         1       3   1     0.0    0.0    5.5    2.9    0.0
#> 2    1         1       3   2     0.0    0.0    0.0   14.0    1.1
#> 3    1         1      10   1     6.4    0.0    0.0   11.0    0.0
#> 4    1         1      10   2     5.9    2.9    0.0    9.9    2.2
#> 5    1         1      15   1     0.1    0.0    5.1    1.2    1.1
#> 6    1         1      15   2     3.0    3.6    2.3    8.8    1.5

Lets use gather and spread to answer some questions

Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.

We need to gather the data into long form, and make plots facetted by the scale.

Ratings on the different scales

library(ggplot2)
ggplot(data=ff_long, aes(x=rating)) + 
  geom_histogram(binwidth=2) + 
  facet_wrap(~variable, ncol=5) 

Side-by-side boxplots

ggplot(data=ff_long, aes(x=rating)) + 
  geom_histogram(binwidth=2) + 
  facet_wrap(~variable, ncol=5) 

Do the replicates look like each other?

We will start to tackle this by plotting the replicates against each other using a scatterplot.

We need to gather the data into long form, and then get the replicates spread into separate columns.

Check replicates

ff.s <- ff_long %>% spread(rep, rating)
head(ff.s)
#>   time treatment subject variable   1    2
#> 1    1         1       3  buttery 0.0  0.0
#> 2    1         1       3   grassy 0.0  0.0
#> 3    1         1       3   painty 5.5  0.0
#> 4    1         1       3   potato 2.9 14.0
#> 5    1         1       3   rancid 0.0  1.1
#> 6    1         1      10  buttery 6.4  5.9

Check replicates

p <- ggplot(data=ff.s, aes(x=`1`, y=`2`)) + 
  geom_point(alpha=0.5) +
  theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2")
p
p + scale_x_log10() + scale_y_log10()

Your turn

Make the scatterplots of reps against each other separately for scales, and treatment.

Are raters giving different scores on average?

ggplot(ff_long, aes(x=subject, y=rating)) + 
  geom_boxplot() + facet_wrap(~variable)

OR

ggplot(ff_long, aes(x=variable, y=rating)) + 
  geom_boxplot() + facet_wrap(~subject)

Legos = tidy data

(Courtesy of Hadley Wickham)

Play mobile = messy data

(Courtesy of Hadley Wickham)

Your turn

Read in the billboard top 100 music data, which contains N'Sync and Backstreet Boys songs that entered the billboard charts in the year 2000

billboard <- read.csv("../data/billboard.csv")

What's in this data? What's X1-X76?

Your turn

  1. Use tidyr to convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)
  2. Use ggplot2 to create this time series plot:

The Split-Apply-Combine Approach

(Diagram originally from Hadley Wickham)

Split-Apply-Combine in dplyr

library(dplyr)
french_fries_split <- group_by(ff_long, variable) # SPLIT
french_fries_apply <- summarise(french_fries_split, rating = mean(rating, na.rm = TRUE)) # APPLY + COMBINE
french_fries_apply
#> # A tibble: 5 × 2
#>   variable rating
#>      <chr>  <dbl>
#> 1  buttery   1.82
#> 2   grassy   0.66
#> 3   painty   2.52
#> 4   potato   6.95
#> 5   rancid   3.85

The pipe operator

  • dplyr allows us to chain together these data analysis tasks using the %>% (pipe) operator
  • x %>% f(y) is shorthand for f(x, y)
  • Example:
french_fries %>%
    gather(key = variable, value = rating, potato:painty) %>%
    group_by(variable) %>%
    summarise(rating = mean(rating, na.rm = TRUE))
#> # A tibble: 5 × 2
#>   variable rating
#>      <chr>  <dbl>
#> 1  buttery   1.82
#> 2   grassy   0.66
#> 3   painty   2.52
#> 4   potato   6.95
#> 5   rancid   3.85

dplyr verbs

There are five primary dplyr verbs, representing distinct data analysis tasks:

  • Filter: Remove the rows of a data frame, producing subsets
  • Arrange: Reorder the rows of a data frame
  • Select: Select particular columns of a data frame
  • Mutate: Add new columns that are functions of existing columns
  • Summarise: Create collapsed summaries of a data frame

Filter

french_fries %>%
    filter(subject == 3, time == 1)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         2       3   1   13.9     0.0    0.0    3.9    0.0
#> 4    1         2       3   2   13.4     0.1    0.0    1.5    0.0
#> 5    1         3       3   1   14.1     0.0    0.0    1.1    0.0
#> 6    1         3       3   2    9.5     0.0    0.6    2.8    0.0

Arrange

french_fries %>%
    arrange(desc(rancid)) %>%
    head
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    9         2      51   1    7.3     2.3      0     15    0.1
#> 2   10         1      86   2    0.7     0.0      0     14   13.1
#> 3    5         2      63   1    4.4     0.0      0     14    0.6
#> 4    9         2      63   1    1.8     0.0      0     14   12.3
#> 5    5         2      19   2    5.5     4.7      0     13    4.6
#> 6    4         3      63   1    5.6     0.0      0     13    4.4

Select

french_fries %>%
    select(time, treatment, subject, rep, potato) %>%
    head
#>    time treatment subject rep potato
#> 61    1         1       3   1    2.9
#> 25    1         1       3   2   14.0
#> 62    1         1      10   1   11.0
#> 26    1         1      10   2    9.9
#> 63    1         1      15   1    1.2
#> 27    1         1      15   2    8.8

Summarise

french_fries %>%
    group_by(time, treatment) %>%
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
#> Source: local data frame [30 x 4]
#> Groups: time [?]
#> 
#>      time treatment mean_rancid sd_rancid
#>    <fctr>    <fctr>       <dbl>     <dbl>
#> 1       1         1         2.8       3.2
#> 2       1         2         1.7       2.7
#> 3       1         3         2.6       3.2
#> 4       2         1         3.9       4.4
#> 5       2         2         2.1       3.1
#> 6       2         3         2.5       3.4
#> 7       3         1         4.7       3.9
#> 8       3         2         2.9       3.8
#> 9       3         3         3.6       3.6
#> 10      4         1         2.1       2.4
#> # ... with 20 more rows

Let's use these tools to answer the rest of the french fries questions

If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person. (Assuming that each person rated on all scales.)

To check this we want to tabulate the number of records for each subject, time and treatment. This means select appropriate columns, tabulate, count and spread it out to give a nice table.

Check completeness

french_fries %>% 
  select(subject, time, treatment) %>% 
  tbl_df() %>% 
  count(subject, time) %>%
  spread(time, n)
#> Source: local data frame [12 x 11]
#> Groups: subject [12]
#> 
#>    subject   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
#> *   <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1        3     6     6     6     6     6     6     6     6     6    NA
#> 2       10     6     6     6     6     6     6     6     6     6     6
#> 3       15     6     6     6     6     6     6     6     6     6     6
#> 4       16     6     6     6     6     6     6     6     6     6     6
#> 5       19     6     6     6     6     6     6     6     6     6     6
#> 6       31     6     6     6     6     6     6     6     6    NA     6
#> 7       51     6     6     6     6     6     6     6     6     6     6
#> 8       52     6     6     6     6     6     6     6     6     6     6
#> 9       63     6     6     6     6     6     6     6     6     6     6
#> 10      78     6     6     6     6     6     6     6     6     6     6
#> 11      79     6     6     6     6     6     6     6     6     6    NA
#> 12      86     6     6     6     6     6     6     6     6    NA     6

Check completeness with different scales, too

french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep) %>%
  select(subject, time, treatment, type) %>% 
  tbl_df() %>% 
  count(subject, time) %>%
  spread(time, n)
#> Source: local data frame [12 x 11]
#> Groups: subject [12]
#> 
#>    subject   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
#> *   <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1        3    30    30    30    30    30    30    30    30    30    NA
#> 2       10    30    30    30    30    30    30    30    30    30    30
#> 3       15    30    30    30    30    30    30    30    30    30    30
#> 4       16    30    30    30    30    30    30    30    30    30    30
#> 5       19    30    30    30    30    30    30    30    30    30    30
#> 6       31    30    30    30    30    30    30    30    30    NA    30
#> 7       51    30    30    30    30    30    30    30    30    30    30
#> 8       52    30    30    30    30    30    30    30    30    30    30
#> 9       63    30    30    30    30    30    30    30    30    30    30
#> 10      78    30    30    30    30    30    30    30    30    30    30
#> 11      79    30    30    30    30    30    30    30    30    30    NA
#> 12      86    30    30    30    30    30    30    30    30    NA    30

Change in rancid ratings over weeks

p <- ff_long %>% filter(variable == "rancid") %>%
  ggplot(aes(x=time, y=rating, colour=treatment)) + 
         geom_point(alpha=0.5) +
  facet_wrap(~subject) 
p

Add means over reps, and connect the dots

ff.av <- ff_long %>% 
  filter(variable == "rancid") %>%
  group_by(subject, time, treatment) %>%
  summarise(rating=mean(rating))
p + geom_line(data=ff.av, aes(group=treatment))

String manipulation

When the experimental design is packed into column names, we need to unpack it.

genes <- read_csv("../data/genes.csv")
genes
#> # A tibble: 3 × 12
#>       id `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1`
#>    <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
#> 1 Gene 1       2.2      2.20       4.2      2.63       5.1        4.5
#> 2 Gene 2       1.5      0.59       1.9      0.52       2.9        1.4
#> 3 Gene 3       2.0      0.87       3.3      0.53       4.6        2.2
#> # ... with 5 more variables: `WI-12.R2` <dbl>, `WI-12.R4` <dbl>,
#> #   `WM-12.R1` <dbl>, `WM-12.R2` <dbl>, `WM-12.R4` <dbl>

Gather column names into long form

gather(genes, variable, expr, -id)
#> # A tibble: 33 × 3
#>        id variable  expr
#>     <chr>    <chr> <dbl>
#> 1  Gene 1  WI-6.R1  2.18
#> 2  Gene 2  WI-6.R1  1.46
#> 3  Gene 3  WI-6.R1  2.03
#> 4  Gene 1  WI-6.R2  2.20
#> 5  Gene 2  WI-6.R2  0.59
#> 6  Gene 3  WI-6.R2  0.87
#> 7  Gene 1  WI-6.R4  4.20
#> 8  Gene 2  WI-6.R4  1.86
#> 9  Gene 3  WI-6.R4  3.28
#> 10 Gene 1  WM-6.R1  2.63
#> # ... with 23 more rows

Separate columns

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-")
#> # A tibble: 33 × 4
#>        id   trt leftover  expr
#> *   <chr> <chr>    <chr> <dbl>
#> 1  Gene 1    WI     6.R1  2.18
#> 2  Gene 2    WI     6.R1  1.46
#> 3  Gene 3    WI     6.R1  2.03
#> 4  Gene 1    WI     6.R2  2.20
#> 5  Gene 2    WI     6.R2  0.59
#> 6  Gene 3    WI     6.R2  0.87
#> 7  Gene 1    WI     6.R4  4.20
#> 8  Gene 2    WI     6.R4  1.86
#> 9  Gene 3    WI     6.R4  3.28
#> 10 Gene 1    WM     6.R1  2.63
#> # ... with 23 more rows

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.")
#> # A tibble: 33 × 5
#>        id   trt  time   rep  expr
#> *   <chr> <chr> <chr> <chr> <dbl>
#> 1  Gene 1    WI     6    R1  2.18
#> 2  Gene 2    WI     6    R1  1.46
#> 3  Gene 3    WI     6    R1  2.03
#> 4  Gene 1    WI     6    R2  2.20
#> 5  Gene 2    WI     6    R2  0.59
#> 6  Gene 3    WI     6    R2  0.87
#> 7  Gene 1    WI     6    R4  4.20
#> 8  Gene 2    WI     6    R4  1.86
#> 9  Gene 3    WI     6    R4  3.28
#> 10 Gene 1    WM     6    R1  2.63
#> # ... with 23 more rows

gtidy <- genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>%
  mutate(trt = sub("W", "", trt)) %>%
  mutate(rep = sub("R", "", rep))
gtidy
#> # A tibble: 33 × 5
#>        id   trt  time   rep  expr
#>     <chr> <chr> <chr> <chr> <dbl>
#> 1  Gene 1     I     6     1  2.18
#> 2  Gene 2     I     6     1  1.46
#> 3  Gene 3     I     6     1  2.03
#> 4  Gene 1     I     6     2  2.20
#> 5  Gene 2     I     6     2  0.59
#> 6  Gene 3     I     6     2  0.87
#> 7  Gene 1     I     6     4  4.20
#> 8  Gene 2     I     6     4  1.86
#> 9  Gene 3     I     6     4  3.28
#> 10 Gene 1     M     6     1  2.63
#> # ... with 23 more rows

Your turn

  1. Using the tidied dataset (gtidy), find the mean expression for each combination of id, trt, and time.
  2. Use this tidied data to make this plot.

Dates and Times

Dates are deceptively hard to work with in R.

Example: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

  • Time zones
  • POSIXct format in base R is challenging

The lubridate package helps tackle some of these issues.

Basic Lubridate Use

library(lubridate)

now()
#> [1] "2017-03-21 14:25:40 AEDT"
today()
#> [1] "2017-03-21"
now() + hours(4)
#> [1] "2017-03-21 18:25:40 AEDT"
today() - days(2)
#> [1] "2017-03-19"

Parsing Dates

ymd("2013-05-14")
#> [1] "2013-05-14"
mdy("05/14/2013")
#> [1] "2013-05-14"
dmy("14052013")
#> [1] "2013-05-14"
ymd_hms("2013:05:14 14:5:30", tz = "America/New_York")
#> [1] "2013-05-14 14:05:30 EDT"

Flight Dates

library(nycflights13)
head(flights)
#> # A tibble: 6 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      517            515         2      830
#> 2  2013     1     1      533            529         4      850
#> 3  2013     1     1      542            540         2      923
#> 4  2013     1     1      544            545        -1     1004
#> 5  2013     1     1      554            600        -6      812
#> 6  2013     1     1      554            558        -4      740
#> # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>
f <- flights %>%
  mutate(date = paste(year, month, day, sep = "-")) %>%
  mutate(date2 = ymd(date)) %>%
  select(date, date2)

Your turn

What's the difference between date and date2?

Use date to plot over time

ff_day <- flights %>% 
  mutate(date = paste(year, month, day, sep = "-")) %>%
  mutate(date = ymd(date)) %>%
  group_by(date) %>%
  tally()
ggplot(ff_day, aes(x=date, y=n)) + geom_line()

By day of the week

ff_day <- ff_day %>%
  mutate(day=wday(date, label = TRUE, abbr = TRUE))
ggplot(ff_day, aes(x=day, y=n)) + geom_boxplot()

Hours and minutes

Use the ymd_hms() function to make a date time that incorporates hours and minutes (bonus: why can't ymd_hms() parse every date time?)

Date times

flights <- flights %>%
  mutate(date = paste(year, month, day, sep = "-")) %>%
  mutate(time = paste(hour, minute, "0", sep = ":")) %>%
  mutate(dt = ymd_hms(paste(date, time))) 

Fitting models across groups

If your apply function returns anything but a single value, use do() instead of summarise()

library(purrr)
models <- flights %>%
  split(.$carrier) %>%
  map(~lm(dep_delay ~ hour, data = .)) %>%
  map_df(coefficients) %>% 
  t() %>% data.frame()
models
#>       X1     X2
#> 9E  -5.8   1.57
#> AA -11.3   1.57
#> AS -10.8   1.32
#> B6  -9.4   1.64
#> DL -11.2   1.54
#> EV  -8.0   2.12
#> F9 -15.5   2.59
#> FL -13.0   2.36
#> HA 242.8 -24.48
#> MQ  -8.3   1.40
#> OO  89.1  -4.47
#> UA  -7.9   1.55
#> US  -9.4   1.08
#> VX -17.8   2.48
#> WN -18.1   2.91
#> YV   4.3   0.95

ggplot(flights, aes(x=hour, y=dep_delay)) + 
  geom_point(alpha=0.5) + 
  geom_smooth(method="lm", se=F) +
  facet_wrap(~carrier)

Your turn

  • What does the intercept mean here?
  • Should you transform the departure delay?

Themes

library(ggthemes)
p + theme_tufte()

p + theme_economist()

Color palettes

p + scale_color_brewer("", palette = "Dark2")

Color blind-proofing

library(dichromat)
library(scales)
clrs <- hue_pal()(3)
p + scale_color_manual("", values=clrs) + theme(legend.position = "none")
clrs <- dichromat(hue_pal()(3))
p + scale_color_manual("", values=clrs) + theme(legend.position = "none")

library(RColorBrewer)
clrs <- brewer.pal(3, "Dark2")
p + scale_color_manual("", values=clrs) + theme(legend.position = "none")
clrs <- dichromat(brewer.pal(3, "Dark2"))
p + scale_color_manual("", values=clrs) + theme(legend.position = "none")

Color palettes

  • Qualitative: categorical variables
  • Sequential: low to high numeric values
  • Diverging: negative to positive values

Perceptual principles

  • Hierarchy of mappings: (first) position along an axis - (last) color (Cleveland, 1984; Heer and Bostock, 2009)
  • Pre-attentive: Some elements are noticed before you even realise it.
  • Color: (pre-attentive) palettes - qualitative, sequential, diverging.
  • Proximity: Place elements for primary comparison close together.
  • Change blindness: When focus is interrupted differences may not be noticed.

Pre-attentive

Can you find the odd one out?

Is it easier now?

Resources